106 research outputs found
Replica-molded electro-optic polymer Mach–Zehnder modulator
A Mach-Zehnder electro-optic polymer amplitude modulator is fabricated by a simple and high-throughput soft-stamp replica-molding technique. The modulator structure incorporates the highly nonlinear and stable chromophore, AJL8, doped in amorphous polycarbonate. Single-arm phase-retardation results in a halfwave voltage (V-pi) of 8.4 V at 1600 nm. The on/off extinction ratio is better than 19 dB, resulting from precise Y-branch power splitters and good waveguide uniformity. These results indicate that the simple fabrication process allows for good optical performance from high-fidelity replicas of the original master devices
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL
submissions
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Vision-language models (VLMs) pre-trained on large-scale image-text pairs
have demonstrated impressive transferability on various visual tasks.
Transferring knowledge from such powerful VLMs is a promising direction for
building effective video recognition models. However, current exploration in
this field is still limited. We believe that the greatest value of pre-trained
VLMs lies in building a bridge between visual and textual domains. In this
paper, we propose a novel framework called BIKE, which utilizes the cross-modal
bridge to explore bidirectional knowledge: i) We introduce the Video Attribute
Association mechanism, which leverages the Video-to-Text knowledge to generate
textual auxiliary attributes for complementing video recognition. ii) We also
present a Temporal Concept Spotting mechanism that uses the Text-to-Video
expertise to capture temporal saliency in a parameter-free manner, leading to
enhanced video representation. Extensive studies on six popular video datasets,
including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show
that our method achieves state-of-the-art performance in various recognition
scenarios, such as general, zero-shot, and few-shot video recognition. Our best
model achieves a state-of-the-art accuracy of 88.6% on the challenging
Kinetics-400 using the released CLIP model. The code is available at
https://github.com/whwu95/BIKE .Comment: Accepted by CVPR 202
Broadband energy-efficient optical modulation by hybrid integration of silicon nanophotonics and organic electro-optic polymer
Silicon-organic hybrid integrated devices have emerging applications ranging
from high-speed optical interconnects to photonic electromagnetic-field
sensors. Silicon slot photonic crystal waveguides (PCWs) filled with
electro-optic (EO) polymers combine the slow-light effect in PCWs with the high
polarizability of EO polymers, which promises the realization of
high-performance optical modulators. In this paper, a broadband,
power-efficient, low-dispersion, and compact optical modulator based on an EO
polymer filled silicon slot PCW is presented. A small voltage-length product of
V{\pi}*L=0.282Vmm is achieved, corresponding to an unprecedented record-high
effective in-device EO coefficient (r33) of 1230pm/V. Assisted by a backside
gate voltage, the modulation response up to 50GHz is observed, with a 3-dB
bandwidth of 15GHz, and the estimated energy consumption is 94.4fJ/bit at
10Gbit/s. Furthermore, lattice-shifted PCWs are utilized to enhance the optical
bandwidth by a factor of ~10X over other modulators based on
non-band-engineered PCWs and ring-resonators.Comment: 12 pages, 4 figures, SPIE Photonics West Conference 201
PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement
Dominant Person Search methods aim to localize and recognize query persons in
a unified network, which jointly optimizes two sub-tasks, \ie, detection and
Re-IDentification (ReID). Despite significant progress, two major challenges
remain: 1) Detection-prior modules in previous methods are suboptimal for the
ReID task. 2) The collaboration between two sub-tasks is ignored. To alleviate
these issues, we present a novel Person Search framework based on the Diffusion
model, PSDiff. PSDiff formulates the person search as a dual denoising process
from noisy boxes and ReID embeddings to ground truths. Unlike existing methods
that follow the Detection-to-ReID paradigm, our denoising paradigm eliminates
detection-prior modules to avoid the local-optimum of the ReID task. Following
the new paradigm, we further design a new Collaborative Denoising Layer (CDL)
to optimize detection and ReID sub-tasks in an iterative and collaborative way,
which makes two sub-tasks mutually beneficial. Extensive experiments on the
standard benchmarks show that PSDiff achieves state-of-the-art performance with
fewer parameters and elastic computing overhead
Uformer: A Unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation
Complex spectrum and magnitude are considered as two major features of speech
enhancement and dereverberation. Traditional approaches always treat these two
features separately, ignoring their underlying relationship. In this paper, we
propose Uformer, a Unet based dilated complex & real dual-path conformer
network in both complex and magnitude domain for simultaneous speech
enhancement and dereverberation. We exploit time attention (TA) and dilated
convolution (DC) to leverage local and global contextual information and
frequency attention (FA) to model dimensional information. These three
sub-modules contained in the proposed dilated complex & real dual-path
conformer module effectively improve the speech enhancement and dereverberation
performance. Furthermore, hybrid encoder and decoder are adopted to
simultaneously model the complex spectrum and magnitude and promote the
information interaction between two domains. Encoder decoder attention is also
applied to enhance the interaction between encoder and decoder. Our
experimental results outperform all SOTA time and complex domain models
objectively and subjectively. Specifically, Uformer reaches 3.6032 DNSMOS on
the blind test set of Interspeech 2021 DNS Challenge, which outperforms all
top-performed models. We also carry out ablation experiments to tease apart all
proposed sub-modules that are most important.Comment: Accepted by ICASSP 202
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation
Despite significant progress in Text-to-Image (T2I) generative models, even
lengthy and complex text descriptions still struggle to convey detailed
controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate
realistic and complex scene images from user-specified layouts, has risen to
prominence. However, existing methods transform layout information into tokens
or RGB images for conditional control in the generative process, leading to
insufficient spatial and semantic controllability of individual instances. To
address these limitations, we propose a novel Spatial-Semantic Map Guided
(SSMG) diffusion model that adopts the feature map, derived from the layout, as
guidance. Owing to rich spatial and semantic information encapsulated in
well-designed feature maps, SSMG achieves superior generation quality with
sufficient spatial and semantic controllability compared to previous works.
Additionally, we propose the Relation-Sensitive Attention (RSA) and
Location-Sensitive Attention (LSA) mechanisms. The former aims to model the
relationships among multiple objects within scenes while the latter is designed
to heighten the model's sensitivity to the spatial information embedded in the
guidance. Extensive experiments demonstrate that SSMG achieves highly promising
results, setting a new state-of-the-art across a range of metrics encompassing
fidelity, diversity, and controllability
Short hybrid polymer/sol-gel silica waveguide switches with high in-device electro-optic coefficient based on photostable chromophore
The highest electro-optic (EO) coefficient to date is achieved in short polymeric directional coupler switches based on hybrid EO polymer/sol-gel silica waveguides. Optimized poling conditions in such waveguides give a highest in-device EO coefficient of 160 pm/V at 1550 nm using highly efficient and photostable guest–host EO polymer SEO100. Adiabatic waveguide transitions from the passive sol-gel core to active EO polymer cores surrounding the sol-gel core are shown using EO polymer cores with a coplanar tapered structure. Switching voltages of 8.4 and 10.5 V are achieved for electrodes that are 2.1 and 1.5 mm long, respectively, which are half those of EO switches containing the chromophore AJLS102
- …